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Abstract 

The genus Streptococcus comprises important pathogens that have a severe impact on human health and are responsible for 
substantial economic losses to agriculture. Here, we utilize 46 Streptococcus genome sequences (44 species), including eight species 
sequenced here, to provide the first genomic level insight into the evolutionary history and genetic basis underlying the functional 
diversity of all major groups of this genus. Gene gain/loss analysis revealed a dynamic pattern of genome evolution characterized by an 
initial period of gene gain followed by a period of loss, as the major groups within the genus diversified. This was followed by a period 
of genome expansion associated with the origins of the present extant species. The pattern is concordant with an emerging view that 
genomes evolve through a dynamic process of expansion and streamlining. A large proportion of the pan-genome has experienced 
lateral gene transfer (LGT) with causative factors, such as relatedness and shared environment, operating over different evolutionary 
scales. Multiple gene ontology terms were significantly enriched for each group, and mapping terms onto the phylogeny showed that 
those corresponding to genes born on branches leading to the major groups represented approximately one-fifth of those enriched. 
Furthermore, despite the extensive LGT, several biochemical characteristics have been retained since group formation, suggesting 
genomic cohesiveness through time, and that these characteristics may be fundamental to each group. For example, proteolysis: mitis 
group; urea metabolism: salivarius group; carbohydrate metabolism: pyogenic group; and transcription regulation: bovis group. 
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Introduction 

The genus Streptococcus comprises approximately 72 species 
of Gram-positive bacteria including numerous species that 
have a severe impact on human health, inflicting significant 
morbidity and mortality (Kohler 2007). In addition, several 
species are responsible for substantial economic losses to ag- 
riculture. For example, Streptococcus pyogenes (Group A 
Streptococcus; GAS) is among the top ten causes of human 
mortality due to infectious disease, inflicting a wide range of 



diseases that include necrotizing fasciitis, toxic shock syn- 
drome, pharyngitis, impetigo, puerperal sepsis, scarlet fever, 
glomerulonephritis, and rheumatic fever (Carapetis et al. 
2005; Ralph and Carapetis 2013). Streptococcus pneumoniae, 
despite being a common member of the normal microbial 
flora of the nose and throat, is the leading cause of bacterial 
disease worldwide, and in addition to common diseases such 
as otitis media, the pathogen is responsible for life-threatening 
sepsis, meningitis, and pneumonia (O'Brien and Nohynek 
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2003). Similarly, S. agalactiae (Group B Streptococcus) is a 
commensal of the genital and gastrointestinal tract yet can 
cause severe invasive disease in adults and neonates (e.g., 
pneumonia, meningitis, and septicemia) (Baker 2000; Baiter 
et al. 2000; Dermer et al. 2004). Streptococcus mutans, an- 
other commensal, is implicated as a leading cause of tooth 
decay, and although not life threatening, the economic 
burden of treatment is substantial (Loesche 1986). Several 
Streptococcus species can cause bovine mastitis (e.g., 
5. uberis, S. agalactiae, S. dysgalactiae subsp. dysgalactiae, 
and 5. canis), with 5. uberis and 5. agalactiae responsible for 
major economic loss to the dairy industry (Zadoks et al. 201 1 ). 

The species within the genus Streptococcus display a wide 
range of epidemiological and ecological characteristics. For 
example, many species are restricted to humans or a single 
animal host, for example, Streptococcus egui subsp. equi is 
restricted to horses, whereas S. agalactiae infects multiple 
hosts ranging from humans to teleosts. Some species are re- 
garded as contagious (transmitted directly between hosts), 
whereas others are regarded as environmental (transmitted 
between the environment and host; for example, S. uberis 
can be transmitted from soil to cows). One species within 
the group (5. thermophilus) is nonpathogenic and is used ex- 
tensively in the dairy industry (Price et al. 201 1). 

Early classification for Streptococcus was based primarily on 
hemolytic reaction and Lancefield group antigens, which di- 
vided the genus into two groups: the pyogenic and viridans 
(Sherman 1 937). The pyogenic group was beta-hemolytic and 
isolated from a range of human and animal sources, whereas 
the viridans group was mostly alpha-hemolytic and isolated 
predominantly from the human oral cavity. Subsequent to 
this, Bentley et al. (1991) and Kawamura et al. (1995) utilized 
1 6S rRNA sequences to divide the genus into six major groups. 
The pyogenic group remained, but viridans was split into five 
subgroups whose names reflected one of the species within 
each of the groups: anginosus, mitis, salivarius, bovis, and 
mutans. It should be noted, however, that neither of these 
studies attempted to provide phylogenetic support for these 
groupings. Subsequently, Facklam (2002) devised an identifi- 
cation scheme based on phenotypic characteristics that could 
delineate species into these same groups plus an additional 
one that he named sanguinis. Combining sequence data from 
1 6S rRNA and the RNase P RNA gene (rnpB), Tapp et al. (2003) 
explored phylogenetic relationships among 50 Streptococcus 
species. Although they recovered all the major groups except 
sanguinis (this group was paraphyletic), relationships among 
the groups were poorly resolved. Here, we make use of 46 
Streptococcus genome sequences (44 distinct species), includ- 
ing eight species for which genome sequences were previ- 
ously unavailable, to provide the first genomic level insight 
into the evolutionary history of all the major groups of this 
genus, as well as the genetic basis underlying their functional 
diversity. Toward a better understanding of the evolution of 
this functional diversity, we also examine gene gain/loss events 



that occurred on all lineages through the course of their 
evolution. 

Materials and Methods 

Sequence Data 

Details regarding the 46 genome sequences (47 including the 
outgroup, see below) used in our analyses are presented in 
table 1 . Of these sequences, 39 were obtained directly from 
National Center for Biotechnology Information (NCBI). The 
remaining eight were sequenced to near completion as part 
of this study. The following six were sequenced using a com- 
bination of Roche 454 and lllumina sequencing technologies 
and assembled using Celera Assembler v6.1 (Myers et al. 
2000): S. criceti (HS-6), S. ictaluri (707-05), 5. macacae 
(NCTC 1 1 558), 5. porcinus (Jelinkova 1 76), 5. pseudoporcinus 
(LQ 940-04), and S. urinalis (2285-97). Streptococcus equi 
subsp. ruminatorum (CECT 5772) was sequenced using 
Roche 454 technology and assembled using Newbler v2.3. 
All of these genome sequences were annotated using the 
Prokaryotic Genome Automated Annotation Pipeline at 
NCBI. Streptococcus iniae (9117) was sequenced using 
Roche 454 technology and assembled using Newbler v1.1. 
In the case of 5. iniae, gene prediction and manually curated 
annotation were performed as described previously 
(Highlander et al. 2007). For two of these species, strains 
from different hosts were included: 5. agalactiae, human 
and bovine and S. parauberis, bovine and flounder. The 
origin of 5. gordonii (Challis substr CH1) is probably human 
blood or an endocarditis valve; however, this is unconfirmed 
(Vickerman M, personal communication). Lactobacillus crispa- 
tus, from the closely related nonpathogenic family 
Lactobacillaceae, was used as an outgroup (Price et al. 201 1). 

Gene Clustering and Phylogenetic Analyses 

Amino acid sequences were delineated into clusters with pu- 
tative shared homology using the Markov clustering (MCL) 
algorithm (van Dongen 2000) as implemented in the 
MCLBIastLINE pipeline (available at http://micans.org/mcl, 
last accessed March 20, 2014). Throughout the article, we 
refer to a set of gene sequences delineated by this method 
as an MCL gene cluster. The pipeline uses MCL to assign gene 
sequences to clusters with putative shared homology based 
on a Blastp search between all pairs of protein sequences 
using an F-value cut off of 1e-5. The MCL algorithm was im- 
plemented using an inflation parameter of 1 .8. Simulations 
have shown this value to be generally robust to false positives 
and negatives (Brohee and van Helden 2006). Nucleotide se- 
quences corresponding to each MCL gene cluster were 
aligned using Probalign v1.1 (Roshan and Livesay 2006). 

For the phylogenetic analysis, we selected those MCL gene 
clusters that were shared among all taxa and contained only 
single gene copies for each taxon (the core set) (n=159) 
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Table 1 

Genome Sequence Details 



Species 


Source 


Tissue/Presentation 


Accession No. 


Streptococcus agalactiae (A909) 


Human 


Neonate 


NC_007432 


S. agalactiae (FSL S3-026) 


Bovine 


Mastitis 


AEXT01 


S. anginosus (1 2 62CV) 


Human 


Rectal biopsy 


ADME01 


S. australis (ATCC 700641) 


Human 


Saliva 


AEQR01 


S. bows (ATCC 700338) 


Human 


Synovial fluid 


AEEL01 


S. canis (FSL Z3-227) 


Bovine 


Mastitis 


AIDX01 


S. constellatus subsp. pharyngis (SK1060) 


Human 


Throat 


AFUP01 


S. cricet/ (HS-6) 


Hamster 


Caries lesion 


AEUV02 


S. cristatus ATCC 51100 


Human 


Periodontal abscess 


AEVC01 


S. downei (F0415) 


Human 


Oral cavity 


AEKN01 


S. dysgalactiae subsp. equisimilis (GGS 124) 


Human 


Toxic shock syndrome 


NC_0 12891 


S. dysgalactiae subsp. dysgalactiae (ATCC 27957) 


Bovine 


Mastitis 


AEGO01 


S. equi subsp. egui (4047) 


Horse 


Strangles 


NC_0 12471 


S. equi subsp. ruminatorum (CECT 5772) 


Sheep 


Mastitis 


AWEX01 


S. ec/u/ subsp. zooepidemicus (MGCS10565) 


Human 


Nephritis 


NC_011134 


S. equinus (ATCC 9812) 


Human 


Gastrointestinal tract 


AEVB01 


S. gallolyticus (UCN34) 


Human 


Blood 


NC_013798 


S. gordonii (Challis substr CH1 ATCC 35105) 


Human? 


Blood/endocarditis valve? 


NC 009785 


S. /cta/uri (707-05) 


Catfish 




AEUX02 


S. infantarius subsp. infantarius (ATCC BAA-102) 


Human 


Feces 


ABJK01 


S. infantis (ATCC 700779) 


Human 


Tooth surface and pharynx 


AEVD01 


S. /'n/ae (9117) 


Human 


Blood 


AMOO01 


S. intermedius (F0413) 


Human 


Dental plaque 


AFXO01 


S. macacae (NCTC 1 1 558) 


Macacae 


Dental plaque 


AEUW02 


S. macedonicus (ACA-DC 198) 


Cheese 




NC_016749 


S. m/t/s (B6) 


Human 


Blood 


NC_013853 


S. mutans (UA1 59) 


Human 


Oral cavity 


NC_004350 


S. ora//s (Uo5) 


Human 


Oral cavity 


NC_015291 


S. parasanguinis (ATCC 15912) 


Human 


Oral cavity 


NC_015678 


S. parauberis (KCTC 11537) 


Flounder 




NC_015558 


S. parauberis (NCFD 2020) 


Bovine 


Mastitis 


AEUT01 


S. pasteurianus (ATCC 43144) 


Human 


Blood 


NC_015600 


S. peroris (ATCC 700780) 


Human 


Tooth surface and pharynx 


AEVF01 


S. pneumoniae (670 6B) 


Human 


Nasopharyngeal 


NC_0 14498 


S. porcinus (Jelinkova 176) 


Swine 


Hemorrhagic lymph nodes 


AEUU01 


S. pseudopneumoniae (IS7493) 


Human 


Sputum; HIV 


NC_015875 


S. pseudoporcinus (LQ 940-04) 


Human 


Female genitourinary tract 


AEUY02 


S. pyogenes (MGAS10394) 


Human 


Throat 


NC_006086 


S. ratt/ (FA-1) 


Rat 


Oral cavity 


AJTZ01 


S. salivarius (JIM8780) 


Human 


Blood 


NC_015760 


S. sanguinis (SK36) 


Human 


Dental plaque 


NC_009009 


S. su/s (05ZYH33) 


Human 


Toxic shock syndrome 


NC_009442 


S. thermophilus (CNRZ1066) 


Yogurt 




NC_006449 


S. uteris (0140J) 


Bovine 


Mastitis 


NC_012004 


S. urinate (2285-97) 


Human 


Urine 


AEUZ02 


S. vestibularis (ATCC 49124) 


Human 


Oral cavity 


AEVI01 


Lactobacillus crispatus (ST1) 


Chicken 


Crop 


NC_014106 



(supplementary table S1, Supplementary Material online). 
Recombination for genes in these clusters was assessed 
using a combination of four methods: 1) Genetic Algorithm 
Recombination Detection (GARD), 2) the Pairwise Homoplasy 
Index (PHI), 3) the Neighbor Similarity Score (NSS), and 4) 



Maximum / 2 . GARD is a phylogenetic method that searches 
for gene segments with incongruent phylogenetic topologies 
(Kosakovsky Pond et al. 2006a). PHI and NSS are compatibility 
methods that examine pairs of sites for homoplasy (Jakobsen 
and Easteal 1996; Bruen et al. 2006). Maximum % 2 is a 
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substitution distribution method that searches for significant 
clustering of substitutions at putative recombination break 
points (Maynard Smith 1992). The tests were implemented 
using GARD (Kosakovsky Pond et al. 2006b) and PhiPack 
(Bruen et al. 2006). We compared two data sets for subse- 
quent analyses. In the first, MCL gene clusters showing evi- 
dence for recombination for all four methods were removed. 
In the second, gene clusters showing evidence for recombina- 
tion for at least three of the four methods were removed. 
Using the first approach, 23 clusters (14.5%) were removed 
leaving 136 clusters. Using the second approach, 84 clusters 
(52.8%) were removed leaving 75 clusters. For each data set, 
rooted maximum likelihood (ML) phylogenies (gene trees) 
were constructed from each MCL gene cluster using PhyML 
v3.0 (Guindon et al. 2010). Then, a species tree based on the 
consensus of the gene trees was constructed using the Triple 
Construction Method as implemented in the program Triplec 
(Ewing et al. 2008). This procedure is based on the observation 
that the most probable three-taxon tree consistently matches 
the species tree (Degnan and Rosenberg 2006). The method 
searches all input trees for the most frequent of the three 
possible rooted triples for each set of three taxa. Once 
found, the set of rooted triples are joined to form the consen- 
sus tree using the quartet puzzling heuristic (Strimmer and 
vonHaeseler 1996). The method has been shown to outper- 
form majority rule and greedy consensus methods (Degnan 
et al. 2009). The species tree built using 136 MCL gene clus- 
ters is shown in figure 1 , and the species tree built using 75 
clusters is shown in supplementary figure S1, Supplementary 
Material online. In general, the two trees shared the same 
topology. However, a notable discrepancy was that 5. agalac- 
tiae (a pyogenic species) clustered with the bovis species group 
in the 75-cluster species tree. This grouping was supported by 
52.9% of the gene trees. The 75-cluster species tree also 
showed more conflict among the gene trees. Consequently, 
given the questionable placement of S. agalactiae in this tree 
coupled with its higher gene tree conflict, we elected to use 
the 136-cluster data set for subsequent analyses. The perfor- 
mance of gene tree consensus approaches, in particular the 
triple consensus approach, improves with the addition of 
more gene trees (Ewing et al. 2008). Consequently, it appears 
that for our data set, the benefit of adding additional genes, 
likely outweighed the possible confounding effect of an ac- 
companying increase in recombination. 

We utilized a second phylogenetic approach, gene align- 
ment concatenation, to construct a species tree. This ap- 
proach has the benefit of providing branch lengths. 
Specifically, the 136 MCL gene cluster alignments were con- 
catenated and all invariant sites removed. The resulting align- 
ment (68,803 bp) was used to build an ML phylogeny using 
GARLI v2.0 (Zwickl 2006). The search was performed using 
the GTR + G substitution model, which was determined to be 
the best fit for the data using the Akaike Information Criterion 
in MODELTEST (Posada and Crandall 1998). GARLI's auto 



termination option was set, allowing it to run until no signif- 
icant improvement in topology was attained. Three search 
replicates were performed. Branch support was provided by 
generating 200 bootstrap replicates. 

Gene Gain and Loss 

We assessed gene gain/loss on the species tree using the par- 
simony based gene tree species tree reconciliation approach 
implemented in the program AnGST (David and Aim 201 1). 
The reconciliation is obtained by inferring a minimum set of 
the following evolutionary events: gene loss, gene duplication, 
speciation, lateral gene transfer (LGT), and gene birth or gen- 
esis. We constrained gene transfer events to only occur be- 
tween contemporaneous lineages on the phylogeny. To 
enforce this option in AnGST, the species tree required 
branch lengths scaled in units of time. We therefore rescaled 
the branch lengths on the concatenation species tree to units 
of time for this analysis. This rescaling was performed using a 
semiparametric method based on the penalized likelihood 
method of Sanderson (2002) as implemented in the chronopl 
function within the ape R package (Paradis et al. 2004). To 
obtain more accurate estimates for the ingroups, the out- 
group and its long connecting branch was removed 
(Magallon and Sanderson 2005). The optimum value for the 
likelihood smoothing parameter (lambda) was determined 
using cross validation of lambda values from 10~ 4 to 10 6 . 
Although Bayesian phylogenetic approaches might produce 
more accurate estimates of time scaled branch lengths, the 
penalized likelihood method was deemed suitable given the 
high computational demands a Bayesian approach would 
likely require to attain convergence on this amount of data 
and that precise dating of nodes in the phylogeny was not our 
objective here. 

Gene trees for all MCL gene clusters containing three or 
more genes were constructed using PhyML v3.0 with the 
GTR + I + G substitution model. AnGST can account for 
gene tree phylogenetic uncertainty by creating a "chimeric" 
gene tree via the amalgamation of bootstrap replicates. We 
utilized this option by providing AnGST with 500 bootstrap 
replicates for every gene tree that contained three or more 
taxa. One MCL gene cluster for sequences annotated as an 
ATP-binding cassette (ABC) transporter protein contained a 
particularly large number of sequences (1,087). Phylogenetic 
analysis of this cluster was mostly unresolved, resulting in a 
large polytomy. This cluster was excluded from the AnGST 
analysis. For MCL gene clusters containing two genes (dou- 
blets), a gene could be seen either once in two separate taxa 
or twice within one taxa. For these genes, a gene tree is 
uninformative. Therefore, we followed Kamneva et al. 
(2012) and attempted to explain the evolutionary history of 
these genes without a gene tree by overlaying the position 
of the two genes onto the species tree and using the same set 
of evolutionary events and event penalties (see later) used in 



744 Genome Biol. Evol. 6(4):741-753. doi:10.1093/gbe/evu048 Advance Access publication March 12, 2014 



Phylogenomics and Dynamic Genome Evolution of Streptococcus 



GBE 





-J95.5 



S. pseudopneumoniae - human — 1 gg 5 
S. pneumoniae - human -J"|99_5 
S. mitis - human 
S. oralis - human 

S. peroris - human 1100 

S. infantis - human 

S. australis - human 199.5 

S. parasanguinis - human 

S. gordonii- human 72.9 

S. sanguinis - human 
S. cristatus - human 

S. intermedius - human 185.4 

S. constellatus subsp. pharyngis - human — ' 1 98 -° 
S. anginosus - human 

S. suis - human 
S. salivarius - human 
S. thermophilus - yogurt 

S. vestibularis - human -I95.5 

'■S. dowriei- human'; 1 100 

S. criceti - .human ;• ' 

S. raff/ - hamster 
S. mutans - human 
S. macacae - macacae 

S. pyogenes - human 

S. canis - bovine 

S. dysgalactiae subsp. equisimilis - human 
S. dysgalactiae subsp. dysgalactiae - bovine 
S. equi subsp. zooepidemicus - human 
S. equi subsp. equi - horse 
S. equi subsp. ruminatorum - ovine 

S. ictaluri - catfish 

S. pseudoporcinus - human 199.0 

S. porcinus - human — ' 

S. uberis - bovine 

S. parauberis - bovine moo 

S. parauberis - flounder I 

S. iniae - human/tilapia 

S. urinalis - human 

S. agalactiae - bovine 1 100 
S. agalactiae - human ' 



99.0 
95.4 



-|99.0 

5 



.I 849 
...J99.0 



97.0 
77.9 



87.9 



S. gallolyticus subsp. pasteurianus (S. pasteurianus) - human 
S. gallolyticus subsp. pasteurianus (S. bow's) - human 
S. gallolyticus subsp. macedonicus (S. macedonicus) - cheese 
S. gallolyticus subsp. gallolyticus (S. gallolyticus) - human 
S. equinus - human 
S. infantarius subsp. infantarius - human 
L. crispatus 



99.5 
97.0 



Fig. 1. — Phytogenies derived from a core set of 1 36 genes. Left: Consensus of the phylogenetic signal from each gene (numbers on branches show the 
proportion of genes that support a particular grouping). Right: ML phytogeny derived from a concatenation of the genes (numbers on branches show 
bootstrap support). Each of the major eight groups is color shaded. The putative downei group is shown with a dashed line. Previous nomenclature for 
species within the bovis group is shown within parentheses. 



the reconciliation procedure to calculate the most parsimoni- 
ous evolutionary scenario of gene birth followed by vertical 
transmission. LGTwas inferred if the number of losses resulted 
in a higher penalty for loss than LGT. In this scenario, we 
counted one birth and one LGT. However, the direction 
of this transfer was unknown. If the two genes were from 
the same genome, we counted one birth and one duplication. 
It should be noted, however, that gene gain/loss analyses 
are limited to the genomes included and that some births 
may be the result of a LGT from a genome not included in 
the analysis. 



AnGST event penalties were first determined by David and 
Aim (201 1) using an approach that selected the combination 
of penalties that minimized the average change in genome 
size between ancestor and descendant within the species tree 
(genome flux). Specifically, their approach fixed the loss pen- 
alty to 1 .0 and then ran multiple reconciliations adjusting LGT 
and duplication penalties. For a wide range of eukaryote, ar- 
chaeal, and bacterial taxa, they found that a LGT penalty of 
3.0 and a duplication penalty of 2.0 minimized genome flux. 
More recently, Kamneva et al. (2012) determined event pen- 
alties using the same approach for a data set containing only 
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bacteria from the Planctomycetes, Verrucomicrobia, and 
Chlamydiae (PVC) phyla. Their penalties were LGT = 5.0 and 
duplication = 3.0. Both studies showed that the LGT penalty 
had the strongest effect on genome flux. We compared rec- 
onciliations using both sets of penalties described earlier and 
also a reconciliation using equal penalties (LGT =1.0, 
duplication = 1.0). 

Hierarchal Clustering and Enrichment 

Hierarchal clustering among genomes using presence/absence 
of MCL gene clusters was performed using the complete link- 
age method and binary distances as implemented in the R 
package pvclust (Suzuki and Shimodaira 2006). Support for 
groupings was obtained by calculating approximately unbi- 
ased P values using 500 bootstrap replicates. 

Gene Ontology (GO) terms were assigned to all 
Streptococcus genomes using Blast2GO v.2.5.0 (Gotz et al. 
2008). Relative enrichment (overrepresentation) of GO terms 
among lineages was assessed using Fisher exact tests. The test 
was performed using the Gossip statistical package (Bluthgen 
et al. 2005) implemented within Blast2GO. The false-discovery 
rate procedure of Benjamini and Hochberg (1 995) was used to 
correct for multiple hypothesis testing (FDR = 0.05). 

Results and Discussion 

Genome Sequencing 

The number of contigs, genome length, number of CDS, 
rRNAs, tRNAs, and %GC for each of the eight genomes se- 
quenced as part of this study are shown in table 2. With the 
exception of S. equi subsp. ruminatorum (CECT 5772) (133 
contigs), these genomes were assembled to a high level of 
contiguity (average contig number = 3, range = 1-8), gener- 
ally consistent with bacterial genomes categorized as 
"noncontiguous finished" (Chain et al. 2009). 

Phylogenetic Relationships 

In general, the consensus and concatenation species trees 
were well supported and concordant (fig. 1). Both trees re- 
covered the pyogenic, bovis, salivarius, and anginosus groups 



described in previous phylogenetic studies (Bentleyetal. 1991; 
Kawamura et al. 1 995; Tapp et al. 2003) and also showed the 
sanguinis group to be monophyletic. All these groups were 
well supported. Recognizing the sanguinis group as a distinct 
monophyletic entity subsequently renders the eight mitis spe- 
cies a monophyletic grouping that is also well supported. 
Streptococcus parasanguinis fell within the mitis group and 
not within the sanguinis group with 5. sanguinis, confirming 
previous studies that showed these two species not to be sister 
taxa (Bentley et al. 1991; Kawamura et al. 1999; Tapp et al. 
2003). Monophyly for the five species previously included in 
the mutans group, however, was not well supported. 
Although S. mutans, S. ratti, and S. macacae all formed a 
well-supported clade, S. downei and 5. criceti, which also 
grouped tightly, clustered instead with the three salivarius 
species. However, inclusion of these two species in the salivar- 
ius group does not appear justified as the grouping was 
weakly supported in both trees, and these two species were 
distantly related to the three salivarius species, which were 
connected by relatively short branch lengths and have been 
shown to be phenotypically distinct. We suggest that 
S. downei and S. criceti might best be considered as compris- 
ing a separate taxonomic assemblage, which we tentatively 
propose as the downei group. At the tip of the trees, there 
were two minor discrepancies. Within the sanguinis group, 
the position of S. gordonii and 5. cristatus was switched, 
and within the pyogenic group, the position of S. pyogenes 
and S. canis was switched. At the base of the trees, relation- 
ships among the salivarius, downei, mutans, pyogenic, and 
bovis groups were poorly resolved, possibly reflecting the 
effect of frequent LGT during the early diversification of 
these groups. Relationships among the mitis, sanguinis, and 
anginosus groups, however, were well supported. 

Clustering and Gene Gain/Loss Analyses 

We identified 19,000 MCL clusters containing a total of 
96,242 gene copies. Of these clusters, 4,322 (22.7%) con- 
tained three or more gene copies, 1,753 (9.2%) contained 
two gene copies (doublets), and 12,925 (68.0%) contained 
one gene copy (singletons). Kamneva et al. (2012) reported 



Table 2 

Genome Characteristics for the Eight Streptococcus Species Sequenced as Part of This Study 



Species 


Contigs 


Base Pairs 


CDS 


rRNA 


tRNA 


%GC 


Streptococcus porcinus (str. Jelinkova 176) 


1 


2,025,881 


1,956 


15 


59 


36.8 


5. macacae (NCTC 11558) 


1 


1,916,985 


1,871 


15 


65 


41 


S. urinalis (2285-97) 


1 


2,130,431 


2,207 


14 


60 


37.8 


S. criceti (HS-6) 


2 


2,417,851 


2,282 


15 


61 


42.2 


S. iniae (9117) 


3 


1,246,519 


2,025 


10 


55 


36.8 


S. pseudoporcinus (LQ 940-04) 


5 


1,816,306 


2,027 


15 


57 


37.1 


S. ictaluri (707-05) 


8 


2,234,402 


2,473 


9 


45 


38.2 


S. equi subsp. ruminatorum (CECT 5772) 


133 


2,140,742 


2,127 


4 


43 


41.4 
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similar proportions for PVC bacteria: three or more gene cop- 
ies= 19.4%, doublets = 10.4%, singletons = 70.2%. 
Genome flux was minimized using a LGT penalty of 3.0 and 
a duplication penalty of 2.0 (the default penalties established 
by David and Aim [201 1]). Using these penalties, we detected 
a dynamic pattern of gene gain/loss through the phylogeny. 
The early evolution of Streptococcus (the branches prior to the 
eight major groups) was characterized by more gene gain 
than loss (balance of gains and losses = 444) (fig. 2), whereas 
evolution for each of the major groups including the branches 
leading to each group and excluding the terminal branches 
was characterized in general by more gene loss than gain 
(overall balance = -1,210) (fig. 2 and table 3). Finally, the 
terminal branches were characterized by much higher gain 
than loss (overall balance = 12,265). 



These results are similar in part to several previous 
studies. For example, the weighted parsimony gene gain/ 
loss study of Makarova et al. (2006) for 1 1 Lactobacillaceae/ 
Leuconostocaceae species that are closely related to 
Streptococcus (Price et al. 2011) showed an overall pattern 
of gene loss. Similarly, using AnGST, Kamneva et al. (2012) 
showed a general pattern of genome shrinkage for PVC bac- 
teria. Using a probabilistic gene birth-and-death model to in- 
vestigate the evolution of Archaea, Csuros and Miklos (2009) 
also showed a pattern of overall gene loss. However, this pat- 
tern was additionally punctuated by periods of gene gain. 
They suggested a scenario where populations successful at 
occupying new environments due to a major physiological 
or metabolic invention, then underwent genomic streamlining 
(gene loss) as they diversified. More recently, using a model of 
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1 2325 1 S. pneumoniae 

1 1974 1 S. mitis 

1 1882 1 S. oralis 

|162Q| s. peroris 

|191Q| S. infantis 

1 2001 1 S. australis 

1 1995 1 S. parasanguinis 

1 2027 1 S. gordonii 

1 2240 1 S. sanguinis 

1 1941 1 S. cristatus 

1 1896 1 S. intermedius 

1 2433 1 S. constellatus subsp. pharyngis 

1 1794 1 S. anginosus 

|2166| S. SUiS 

1 1995 1 S. salivarius 

1 1896 1 S. thermophilus 

1 1869 1 S. vestibularis 

1 2174 1 S. downel 
\E^\S...criceti...\ 

|2016| S. ratti 

1 1932 1 S. mutans 

1 1849 1 S. macacae 

|2194| S. canis 

1 1868 1 S. pyogenes 

1 2067 1 S. dysgalactiae subsp. equisimilis 

1 2Q83 1 s. dysgalactiae subsp. dysgalactiae 

I less | S. equi subsp. zooepidemicus 

1 1978 1 S. equi subsp. equi 

1 2103 1 S. equi subsp. ruminatorum 

1 2454 1 S. ictaluri 

1 2004 1 S. pseudoporcinus 

1 1934 1 S. porcinus 

1 1739 1 s. uberis 

1 2122 1 S. paraubehs - bovine 

1 1 849 1 s. paraubehs - fish 

1 2001 1 S. iniae 

1 2188 1 S. urinalis 

1 2307 1 S. agalactiae - bovine 

1 1969 1 S. agalactiae - human 

1 1 844 1 s. pasteurianus (sgp) 

1 2064 1 s. bovis (sgp) 

1 217Q | S. macedonicus (sgm) 

1 2198 1 S. gallolyticus (sgg) 

1 1772 1 S. equinus 

1 2026 1 S. infantarius subsp. infantarius 

1 2007 1 L. crispatus 



Fig. 2. — Result of gene gain/loss analysis. Boxes on nodes and tips of the phylogeny show genome size (genome size calculations exclude one MCL gene 
cluster that was excluded from the analysis — see Materials and Methods). Numbers on branches show the number of gains, losses, and the overall balance 
(gain-loss: balance), with red indicating an overall gain and blue an overall loss. Color shading for each major Streptococcus group follows figure 1 . 
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Table 3 

Gene Gain/Loss Summary 



Group 


Gains 


Losses 


Balance 


Mitis a 


1,059 


1,208 


-149 


Sanguinis 3 


343 


228 


115 


Anginosus 3 


187 


458 


-271 


Salivarius 3 


397 


540 


-143 


Downei 3 


236 


233 


3 


Mutans 3 


268 


364 


-96 


Pyogenic 3 


2,476 


2,951 


-475 


Bovis 3 


657 


851 


-194 


Mitis b 


3,375 


1,504 


1,871 


Sanguinis 6 


1,532 


1,007 


525 


Anginosus b 


2,202 


774 


1,428 


Salivarius b 


1,604 


454 


1,150 


Downei b 


1,384 


268 


1,116 


Mutans b 


1,376 


700 


676 


Pyogenic 11 


7,935 


3,866 


4,069 


Bovis b 


2,587 


1,157 


1,430 



a Gain/loss for each group including the branches leading to each group and 
excluding the terminal branches. 

b Gain/loss for terminal branches for each group. 



virtual cells evolving to maintain homeostasis, Cuypers and 
Hogeweg (2012) showed a similar general pattern of 
genome evolution where genomes undergo an initial period 
of expansion as they adapt to new environments followed by 
a longer period of streamlining where genes redundant in the 
new environment are lost. Over the broad timescale of our 
data, our results are similar to the latter studies, showing an 
initial period of genome expansion as the major groups began 
to diversify. Then, after the groups formed, genomes in gen- 
eral experienced a period of reductive evolution (streamlining). 
Finally, there was a more recent period of genome expansion 
as the majority of present day species evolved. Also using 
AnGST, David and Aim (201 1) showed a similar pattern over 
an even larger time scale for the three domains of life, with an 
initial period of expansion during the Archaean followed by a 
period of streamlining. However, they did not show a more 
recent genomic expansion for prokaryotes. As highlighted by 
the authors, this discrepancy is likely explained by the fact that 
their analysis did not include singletons. In our analysis, single- 
tons were responsible for the majority of gains on the terminal 
branches. For example, the overall balance of gains and losses 
on the terminal branches would be reduced from 1 2,265 to 
704 if the singletons were removed. However, this high 
number of singletons should be considered in the context 
that, for the most part, our phylogeny only contained a 
single strain for each species. If it had been possible to include 
multiple strains for each species, we would have been able to 
better assess their dispensable genomes and therefore detect 
more gene losses; the overall proportion of singletons would 
have decreased, and the proportion of losses toward the ter- 
minal branches of the phylogeny would have increased (e.g., 
Lefebure and Stanhope 2007). Nevertheless, there is still an 



overwhelming pattern of high gain on the terminal branches, 
which has been reported numerous times for a wide range of 
bacterial species including those within the genus 
Streptococcus (Hao and Golding 2004; Hao and Golding 
2006; Marri et al. 2006, 2007; Lefebure and Stanhope 
2007; Kamneva et al. 2012). Additionally, several of these 
studies (including two that focused on Streptococcus) pro- 
vided evidence that laterally acquired genes on terminal 
branches had a role in adaptation (Hao and Golding 2004; 
Hao and Golding 2006; Marri et al. 2006, 2007; Lefebure and 
Stanhope 2007; Kamneva et al. 2012). 

Of the proportion of genes gained on the terminal 
branches, 6.8% were identified as LGT (see supplementary 
fig. S2, Supplementary Material online, for a breakdown of 
all evolutionary events). The majority of the remainder were 
gains classified as genes born on these branches (singletons). 
However, given that these genes were not present in any of 
the remaining taxa, it is possible that these genes were ac- 
quired via LGT from bacteria not included in our analysis. This 
possibility was explored by Lefebure et al. (2012) who per- 
formed a similar AnGST analysis on 1 5 Streptococcus species 
and showed that approximately two-thirds of the genes born 
on the 5. pyogenes branch were likely LGTs from species not 
included in their analysis (they had significant Blast hits with 
the NCBI nr database). Although the remainder may have 
been born de novo on this branch, it is also possible that ho- 
mologous genes have yet to be sequenced. Overall, our find- 
ings suggest that a large proportion of the Streptococcus pan- 
genome (all MCL gene clusters) has been involved in LGT. For 
example, 1 8.0% of all MCL clusters were directly identified as 
being involved in LGT and 68.0% of all clusters were single- 
tons, many of which are likely to be LGTs. Assuming that two- 
thirds of the genes born on terminal branches are actually 
LGTs, this suggests that over 60% of the pan-genome has 
been subject to LGT. If we assume that all the genes born on 
the terminal branches are LGTs, it suggests that over 80% of 
the pan-genome has been subject to LGT. 

From the perspective of gene gain/loss (turnover) for indi- 
vidual species, 5. constellatus subsp. pharyngis is notable, as 
this taxon showed considerably more turnover than any other 
(gain = 1,408, loss= 503) (fig. 2), suggesting that perhaps this 
species was shifting or expanding its niche (Hao and Golding 
2006; Marri et al. 2007). This high turnover was also reflected 
in the hierarchal clustering analysis (presence/absence of MCL 
gene clusters) (fig. 3) where this species was placed as an 
outlier to all remaining Streptococcus species. Similarly, the 
important human pathogens S. pyogenes and 5. pneumoniae 
also showed high gene turnover, ranking 6th and 9th, respec- 
tively. In contrast, the two strains of 5. agalactiae showed 
considerably less turnover ranking 21st (bovine isolate) and 
40th (human isolate). The higher turnover for the bovine iso- 
late compared with the human isolate might reflect a more 
recent adaptation to this environment. Similarly, 5. parauberis, 
which was traditionally associated with the bovine 
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Fig. 3. — Hierarchal clustering among genomes using presence/ab- 
sence of MCL gene clusters. Approximately unbiased P values are 
shown on branches. Color shading for each major Streptococcus group 
follows figure 1 . 



environment (Williams and Collins 1990), has recently been 
identified as an emerging fish pathogen (Nho et al. 201 1), and 
the fish isolate (19th) showed considerably more turnover 
than the bovine isolate (38th). Streptococcus suis, which re- 
mained an outlier to the mitis, sanguinis, and anginosus 
groups, also showed particularly high turnover (gain= 1,020, 
loss = 380, ranking 2nd). This species is a major porcine path- 
ogen; however, the species has recently been identified as a 
particularly virulent emerging zoonotic pathogen and as an 
etiological agent for streptococcal toxic shock syndrome 
(STSS) (Lun et al. 2007). The S. suis strain included in our 
analysis (98HAH33) was isolated from a fatal case of STSS 
from an outbreak in China (Chen et al. 2007). Again, the 
high gene turnover for this species may reflect its recent ad- 
aptation to the human environment. 

A pattern of rapid adaptive radiation is perhaps best illus- 
trated by the pyogenic group where there is little correlation 
between evolutionary relationship and host species (fig. 1 ). For 
example, the S. equi subspecies are all separated by relatively 



short branch lengths yet have distinct niches: S. equi subsp. 
equi is typically restricted to horses, S. equi subsp. rumina- 
torum has been isolated from mastitic sheep and goats, and 
5. equi subsp. zooepidemicus can infect a wide range of hosts, 
which suggests a recent diversification of these varied niches. 
Similarly, S. dysqalactiae subsp. equisimilis is typically isolated 
from the human environment, whereas S. dysgalactiae subsp. 
dysgalactiae is typically isolated from the bovine environment. 
There are also human and bovine ecotypes within S. agalac- 
tiae and bovine and fish ecotypes within S. parauberis, sug- 
gesting more recent adaptation. It is likely that LGT is a major 
evolutionary mechanism responsible for this rapid adaptation 
(Aim et al. 2006; Marri et al. 2006, 2007). Both genome re- 
latedness and physical proximity are probable major factors 
affecting the frequency of this LGT. 

With the exception of S. parasanguinis and 5. australis, 
which clustered with the sanguinis group, and S. constellatus 
subsp. pharyngis, which was an outlier to all Streptococcus 
species, the hierarchal clustering analysis recovered all the 
major groups (fig. 3). However, relationships within the 
groups were not concordant with the species tree, suggesting 
that LGT is more likely to occur within the groups than be- 
tween them. Lack of concordance with the species tree within 
the groups also suggests that relatedness has less of an influ- 
ence on LGT and that other factors such as shared environ- 
ment may be playing a role. Indeed, previous studies have 
provided good evidence for LGT among numerous 
Streptococcus species within a shared bovine environment 
(Richards et al. 2011, 2012). Furthermore, our analysis in- 
cluded five distantly related pyogenic species isolated from 
the bovine environment, and we detected LGT between ten 
different species-pair combinations involving those taxa. From 
the perspective of the genus as a whole, a factor contributing 
to the strong support for the major groups in the hierarchal 
clustering analysis might be the effect of group-specific gene 
loss (streamlining). Consequently, the combined effect of the 
reduced likelihood of intergroup LGT and group-specific 
streamlining may have contributed to genomic cohesiveness 
of the major taxonomic groups throughout the evolution of 
Streptococcus. 

GO Term Enrichment and Genes Born on Branches 
Leading to Major Streptococcus Groups 

Each of the eight phylogenetic groups was tested for enrich- 
ment of GO terms relative to the other groups. All groups 
showed enrichment for at least one of the three GO term 
domains of biological process (P), molecular function (F), and 
cellular component (C) (supplementary table S2, 
Supplementary Material online). The number of terms en- 
riched for each group was as follows: mitis = 98; sangui- 
nis =17; anginosus = 6; salivarius = 34; downei = 6; 
mutans = 25; pyogenic = 241; and bovis =198. For each 
GO term attached to a particular gene, there are typically 
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more general parent terms connected to it. The GO can be 
represented by a directed acyclic graph (DAG) and the parent 
terms become increasingly more general moving up through 
DAG levels. For the enrichment test, the term count for a 
particular gene includes all parent terms in the DAG 
(Bluthgen et al. 2005). Reducing the terms to their most spe- 
cific, reduced the term count as follows: mitis = 47; sangui- 
nis =7; anginosus=1; salivarius= 18; downei=1; 
mutans=5; pyogenic = 120; and bovis = 64 (supplementary 
table S3, Supplementary Material online). 

To gain perspective on those genes that have likely been 
important for the evolution of each of the eight phylogenetic 
groups, we delineated the GO terms for the genes born on the 
branches leading to each group (supplementary table S3, 
Supplementary Material online). We then compared the 
terms on each group's branch to those that were enriched 
(most specific terms) for the same group. We omitted cellular 
component terms, as there were relatively few or no terms 
from this domain enriched for each group. In the discussion 
that follows, we primarily focused on terms that were both 
born on a group's branch and also enriched for the group. We 
acknowledge the obvious caveat that our analysis does not 
contain genome sequences for all Streptococcus species. 
Nevertheless, we provide the most complete assessment of 
the evolution of Streptococcus biochemical characteristics to 
date. 

The Mitis Group 

The mitis group showed enrichment for proteolysis, with 
genes annotated with this term occurring in the highest fre- 
quency. Genes with this term were distributed fairly evenly 
among species within the group (the number of genes as- 
signed the term for each species ranged from 39 to 67 [aver- 
age -55]). The mitis group is primarily composed of 
commensal organisms of the upper respiratory tract and pio- 
neer colonizers of dental plaque. Proteases have important 
roles in both these environments. For example, in plaque bio- 
films, bacteria utilize proteases to exploit salivary proteins as a 
nutrient source (Bradshaw et al. 1 994; Wickstrom et al. 2009), 
and many species of pathogenic bacteria secrete proteases 
that interfere with host defenses and/or damage host cells 
or tissues (Harrington 1996; Miyoshi and Shinoda 2000; 
Potempa et al. 2000). Proteases can also aid bacteria spread 
and dissemination through tissue. For the mitis group, prote- 
olysis appears to be a defining feature, as in addition to the 
enrichment, there were 17 genes born on the mitis branch 
that were also annotated with this term, suggesting that this 
characteristic has been retained throughout the group's evo- 
lutionary history. The characteristic is likely important to the 
group as a whole, since once it was gained; no species within 
the group lost it (not subject to genomic streamlining). 

Terms for response to antibiotic and antibiotic transport 
were enriched for the mitis group. Studies focusing on 



antibiotic resistance for Streptococcus have indicated that in 
general the viridans group shows substantial resistance to an- 
tibiotics (in particular beta-lactam antimicrobials) (Facklam 
2002). However, none of the genes born on the mitis 
branch were annotated with GO terms for response to anti- 
biotic and antibiotic transport, suggesting that resistance to 
antimicrobials was acquired more recently via LGT. In contrast, 
a notable feature of the mitis group was the enrichment for N- 
acety transferase activity combined with the occurrence of ten 
genes born on the group's branch with this term. Further 
examination of the genes responsible for the enrichment 
showed approximately half (50.7%) to be annotated as 
GCN5-related A/-acety transferases (GNAT). More specifically, 
the number of GNAT genes for each of the eight mitis species 
ranged from 5 to 22 (average - 14). Some members of the 
GNAT family of acetyltransferases confer resistance to amino- 
glycoside antibiotics (Vetting et al. 2005), and this resistance 
has been reported for several viridans species (Collatz et al. 
1984; Horaudand Delbos 1984). Our findings suggest that for 
the mitis group, this resistance might be due in part to an 
intrinsic resistance of the group as a whole to 
aminoglycosides. 

The Sanguinis Group 

Of the terms enriched for the sanguinis group, the most fre- 
quent was A/-acetyltransferase, and there were nine genes 
with this term born on the group's branch. Examination of 
the genes responsible for this enrichment showed that 78.3% 
were annotated as GNAT, and the number of GNAT genes for 
each of the three sanguinis species ranged from 20 to 30 
(average - 24). These findings are similar to those for the 
mitis group and suggest the same possibility of an intrinsic 
ability or potential for resistance to aminoglycoside antibiotics. 

The Salivarius Group 

Notable enrichment for the salivarius group was for urease 
activity, urea metabolic process, nickel cation binding, and 
both cobalt and calcium transport. The genes responsible 
for the urease and urea enrichment belong to an inducible 
urease operon, which allows for the metabolism of urea found 
in saliva (Chen et al. 1998). Specifically, the genes included 
those encoding all three of the structural proteins (alpha, beta, 
and gamma) and one of the four accessory genes. These 
genes were also born on the group's branch suggesting that 
this characteristic has been retained throughout the group's 
history. However, not all S. salivarius strains possess the 
operon. Geng et al. (201 1) showed it to be absent in strain 
SK126, suggesting a recent loss of the operon from some 
strains within this species (perhaps reflecting the early stages 
of streamlining). The urease enzyme requires nickel to func- 
tion (Chen et al. 1998), explaining the enrichment for nickel 
cation binding. In addition, Chen and Bume (2003) showed 
that for S. salivarius, a three gene cobalt ATP-dependent 
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binding cassette (ABC) transporter immediately 3' to the 
urease operon (ureMQO) was functioning as a nickel trans- 
porter, likely explaining the enrichment for cobalt transport. 
A search of the chromosome, wgs, and refseq_genomic data- 
bases at NCBI for the complete urease operon 
(ureABCEFGDMQO) showed it to be only present in the 
three salivarius species. Interestingly, 5. thermophilus has 
been reported as nonureolytic in the literature (Facklam 
2002). One possible explanation for this might be that similar 
to S. salivarius, certain strains of S. thermophilus lack the 
operon. 

The salivarius group also showed enrichment for transpo- 
sase activity (molecular function) and DNA-mediated transpo- 
sition (biological process), with these terms born on the 
group's branch. The terms occurred relatively frequently, sug- 
gesting potential for high levels of recombination. However, 
the gene gain/loss analysis showed the group to rank last re- 
garding the number of genes exchanged among all 46 taxa 
(LGT counts for each group were normalized by dividing the 
count by the number of taxa in the group) (supplementary 
table S4, Supplementary Material online), suggesting that re- 
combination was more likely between strains within a species 
than among species. The pyogenic group showed a very sim- 
ilar pattern for these two terms. Both were enriched, occurred 
very frequently, and were born on the group's branch. This 
pattern suggests independent origins for the transposases in 
the salivarius and pyogenic groups and that these transpo- 
sases have tended to remain within their respective groups 
through evolutionary time, again suggesting that recombina- 
tion mediated by these genes was more likely to have oc- 
curred among more closely related sequences. 

The Pyogenic Group 

For the pyogenic group, the majority of enriched terms in- 
volved transport and metabolism; more specifically, the phos- 
phoenolpyruvate-dependent sugar phosphotransferase 
system (PTS) and the metabolism of carbohydrates. Terms 
for the PTS were among the most frequent and the genes 
with this term (and also several associated with carbohydrate 
metabolism) were born on the group's branch, suggesting 
that this characteristic has been retained throughout the 
group's history. The MCL clustering analysis delineated the 
genes with PTS GO terms into eight clusters. Six contained 
PTS permeases (HA, B, and C), and two contained membrane 
regulatory proteins. Most hexose sugars and disaccharides 
used by streptococci are taken up by the PTS (Price et al. 
2011), and their enrichment for the pyogenic group could 
reflect the wide range of hosts and environments inhabited 
by the species of this group. The pyogenic group also showed 
enrichment for the pathogenesis GO term. Although the 
genes responsible were distributed among all species in 
the group, they were not distributed evenly (supplementary 
table S5, Supplementary Material online). Not surprisingly, 



5. pyogenes had the highest number of these genes (19). 
The next two highest species were S. dysgalactiae subsp. equi- 
similis (12) and 5. equi subsp. equi (12), with the former 
an emerging human pathogen (Brandt and Spellerberg 
2009). In total, the MCL clustering analysis showed 5. pyo- 
genes to share 1 1 pathogenesis genes with other pyogenic 
species (supplementary table S5, Supplementary Material 
online). Streptococcus dysgalactiae subsp. equisimilis shared 
the most genes (eight), likely contributing to its emergence 
as a human pathogen. Streptococcus equi subsp. equi was the 
next highest with five. None of the genes with pathogenesis 
GO terms were born on the pyogenic group's branch, indicat- 
ing that these genes were acquired via LGT. 

The Bovis Group 

The bovis group showed enrichment for regulation of tran- 
scription (DNA dependent) and sequence-specific DNA-bind- 
ing transcription factor activity, with genes annotated with 
these two terms occurring in the highest frequency. These 
results suggest an enhanced regulatory ability for the bovis 
group, where strains can activate/deactivate multiple meta- 
bolic pathways to survive in resource poor environments 
such as the colon. Genes for both terms were also born on 
the group's branch, suggesting that this regulatory ability has 
been retained throughout the group's history. Examination of 
the genes responsible for this enrichment, that were also born 
on the group's branch, revealed six MCL clusters that in gen- 
eral exclusively contained transcriptional regulator genes from 
each of the bovis species (supplementary table S6, 
Supplementary Material online). Notably, one of the clusters 
contained genes from the multiple antibiotic resistance regu- 
lator (MarR) family of regulators. For Escherichia coli, this 
family of regulators has been shown to regulate genes respon- 
sible for resistance to antibiotics and other toxic chemicals 
(Alekshun and Levy 1999). Annotations for genes in the 
other clusters suggested that they belonged to the LysR, 
Crp-Fnr, and GntR families of regulators and were involved 
in regulation of anaerobic metabolic pathways and pyridoxine 
(vitamin B6) metabolism. The transcriptional regulator genes 
identified here represent useful targets for further study. For 
example, ascertaining virulence for the genes regulated, 
which could be particularly valuable given the possible involve- 
ment in antibiotic resistance for several of the regulators 
identified. 

Conclusion 

Revealing a pattern of genomic expansion and streamlining 
for Streptococcus, our study adds to the emerging view that 
genomes evolve in this manner (Cuypers and Hogeweg 201 2) 
and lends support to the proposal that this process may be a 
generic pattern of evolving systems (Cuypers and Hogeweg 
201 2). Furthermore, our results demonstrate that despite LGT 
affecting a substantial proportion of the Streptococcus 
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pan-genome, many of the major groups have retained distinct 
core characteristics since their formation. 

Supplementary Material 

Supplementary tables S1-S7 and figures S1 and S2 are avail- 
able at Genome Biology and Evolution online (http:/A/vww. 
gbe.oxfordjournals.org/). 
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